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Abstract 

High  performance  for  numerical  linear  algebra  often  comes  at  the  expense  of  stability.  Computing 
the  LU  decomposition  of  a  matrix  via  Gaussian  Elimination  can  be  organized  so  that  the  computation 
involves  regular  and  efficient  data  access.  However,  maintaining  numerical  stability  via  partial  pivoting 
involves  row  interchanges  that  lead  to  inefficient  data  access  patterns.  To  optimize  communication  effi¬ 
ciency  throughout  the  memory  hierarchy  we  confront  two  seemingly  contradictory  requirements:  partial 
pivoting  is  efficient  with  column-major  layout,  whereas  a  recursive  layout  is  optimal  for  the  rest  of  the 
computation.  We  resolve  this  by  introducing  a  shape  morphing  procedure  that  dynamically  matches  the 
layout  to  the  computation  throughout  the  algorithm,  and  show  that  Gaussian  Elimination  with  partial 
pivoting  can  be  performed  in  a  communication  efficient  and  cache-oblivious  way.  Our  technique  extends 
to  QR  decomposition,  where  computing  Householder  vectors  prefers  a  different  data  layout  than  the  rest 
of  the  computation. 
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1  Introduction 


Do  we  need  to  trade  off  numerical  stability  for  high  performance?  This  has  been  the  most  important  question 
in  numerical  linear  algebra  for  at  least  20  years.  It  has  motivated  an  enormous  body  of  deep  research.  In  this 
paper  we  show  that  for  one  very  famous  computation  in  numerical  linear  algebra,  the  answer  is  no:  Gaussian 
Elimination  with  partial  pivoting  can  be  performed  in  a  communication  avoiding  way. 

High  performance  computers  do  not  resemble  simple  computational  models  like  the  RAM  model.  They 
rely  on  parallelism  and  complex  memory  hierarchies  to  deliver  high  performance.  In  the  past,  such  archi¬ 
tectures  were  confined  to  supercomputers,  but  today  they  arc  ubiquitous.  To  run  fast,  an  algorithm  must  be 
able  to  utilize  many  processors  concurrently  and  to  avoid  communication  as  much  as  possible. 

Out  of  all  the  effective  algorithms  for  a  given  problem,  only  a  subset  exhibits  high  levels  of  parallelism 
and  requires  little  communication  between  processors  and/or  between  levels  of  the  memory  hierarchy.  Does 
this  subset  always  contain  algorithms  that  arc  as  stable  as  the  best  ones  for  the  problem,  or  do  we  need 
to  trade  off  stability  for  high  performance?  Consider  Csanky’s  algorithm  for  matrix  inversion:  it  has  long 
been  a  classic  example  of  a  highly  parallel  but  highly  unstable  algorithm;  no  stable  algorithm  is  as  parallel. 
Twenty  years  ago,  one  of  the  authors  suggested  in  an  influential  paper  that  even  in  practice,  we  must  trade 
off  stability  in  return  for  useful  amounts  of  parallelism  [5],  That  paper  has  motivated  a  huge  amount  of 
research,  with  two  main  focal  points.  One  has  been  the  stability  of  so-called  fast  (Strassen-like)  algorithms; 
this  research  has  so  far  culminated  in  algorithms  that  are  stable  and  fast  in  theory,  but  it  remains  to  be  seen 
whether  they  arc  also  fast  in  practice  [4],  The  other  focal  point  has  been  in  algorithms  that  perform  as  little 
communication  as  possible,  culminating  in  the  definition  of  communication  avoidance  [3]  and  in  a  class  of 
algorithms  with  that  property. 

An  algorithm  is  called  communication  avoiding  if  it  performs  asymptotically  as  little  communication  as 
possible  in  two  metrics:  the  total  volume  of  data  measured  in  words  transferred  between  processors  or  levels 
of  the  memory  hierarchy  (the  bandwidth  it  consumes),  and  the  number  of  messages  or  block-transfers  that 
carry  this  data  (and  therefore  the  number  of  times  the  message  or  cache-miss  latency  impacts  the  execution). 
To  show  that  an  algorithm  is  communication  avoiding,  one  must  exhibit  a  communication  lower  bound.  For 
many  matrix  algorithms,  lower  bounds  of  the  form  <2  (j/y/M  j  have  been  established  on  the  total  volume  of 

communication  and  Q(f  /M1,5)  on  the  number  of  messages,  where  /  is  the  number  of  arithmetic  operations 
performed  by  the  algorithm  and  M  is  the  size  of  the  fast  memory  in  a  hierarchy  or  the  local  memory  in  a 
distributed  memory  parallel  computer  [3,  14]. 

Minimizing  the  volume  of  communication  while  preserving  numerical  stability  has  proved  relatively 
easy  for  many  problems.  For  Gaussian  Elimination  with  partial  pivoting  (using  the  largest-magnitude  ele¬ 
ment  in  a  column  to  eliminate  the  rest  of  the  column),  a  1997  algorithm  with  a  recursive  schedule  did  the 
trick  [13,  17]  for  the  sequential  (memory-hierarchy)  case;  this  algorithm  is  also  cache  oblivious,  in  the  sense 
that  its  schedule  does  not  depend  on  M. 

Minimizing  the  number  of  block-transfers  while  maintaining  stability  has  proved  much  harder.  The  first 
communication  avoiding  algorithm  for  Gaussian  Elimination  [11]  used  a  pivoting  rule  called  tournament 
pivoting  that  was  both  more  complicated  and  less  stable  than  partial  pivoting.  A  second-generation  commu¬ 
nication  avoiding  Gaussian  Elimination  algorithm  [15]  was  even  more  complicated,  but  also  more  stable. 
The  fundamental  challenge  that  required  the  new  pivoting  rules  is  that  partial  pivoting  steps  works  well 
when  the  matrix  is  stored  by  column,  whereas  updating  the  reduced  matrix  works  well  when  the  matrix  is 
stored  with  contiguous  blocks.  The  question  of  whether  the  simple,  elegant,  and  stable  partial  pivoting  rule 
can  be  used  in  a  communication  avoiding  algorithm  remained  open. 

In  this  paper  we  answer  this  question  in  the  affirmative  for  the  sequential  case  using  a  technique  we 
call  shape  morphing :  switching  the  data  layout  of  parts  of  the  matrix  back  and  forth  between  column-major 
layout  and  recursive  block-contiguous  layout.  Doing  so  allows  Gaussian  Elimination  to  access  contiguous 
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memory  locations  both  when  searching  for  a  pivot  down  a  column  and  applying  row  interchanges,  and  when 
computing  the  U  factor  and  updating  the  reduced  matrix  (Schur  complement).  The  shape  morphing  steps 
add  data  movement  overhead  to  the  algorithm,  but  we  show  that  the  overall  algorithm  remains  asymptotically 
efficient.  The  algorithm  is  recursive  and  also  cache-oblivious. 

The  same  technique  also  produces  communication  avoiding  algorithms  for  the  related  problem  of  QR 
factorization.  In  addition,  we  present  a  communication  efficient  algorithm  for  solving  a  triangular  system 
where  the  right  sides  form  a  rectangular  matrix.  This  subroutine  is  necessary  inside  SMLU,  but  is  also  used 
in  several  other  contexts. 

Algorithm  1  SMLU,  in  words.  See  Figure  1  and  Algorithm  8  for  further  details. 

if  one  column  then 

solve  the  problem  for  a  column 

end  if 

recursively  factor  the  left  half 
forward  permute 

reshape  everything  to  recursive  format 

update  right  half  with  triangular  solve  and  Schur  update 

reshape  everything  back  to  column  format 

recursively  factor  the  right  half 

back  permute 

combine  pivots 


Algorithm  2  SMQR,  in  words.  See  Figure  2  for  further  details. 

if  one  column  then 

solve  the  problem  for  a  column 

end  if 

recursively  factor  the  left  half 
reshape  everything  to  recursive  format 

update  right  half  with  triangular  and  general  matrix  multiplies 
reshape  right  half  back  to  column  format 
recursively  factor  the  right  half 
reshape  right  half  to  recursive  format 

compute  auxiliary  triangular  matrix  T  with  triangular  and  general  matrix  multiplies 
reshape  everything  back  to  column  format 


2  Machine  Model 

We  model  a  sequential  computer  as  having  an  infinite  slow  memory  and  a  finite  fast  memory  of  size  M. 
Ah  computation  takes  place  in  the  fast  memory,  and  we  consider  communication  between  the  fast  and  slow 
memory.  We  count  both  the  number  of  words  of  data  W  (or  bandwidth  cost)  and  the  number  of  messages  S 
( latency  cost)  transferred,  and  model  the  communication  time  as 

a- S  +  P-W, 

where  a  and  /?  are  machine-dependent  parameters.  There  is  one  more  parameter,  L,  which  is  the  size  of  the 
maximum  allowed  message  (or  block-transfer  size).  We  make  no  assumptions  on  the  size  of  L  beyond  the 
trivial  requirements  1  <  L  <  M. 
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It  is  instructive  to  contrast  our  model  to  the  ideal-cache  model  of  [10].  There,  the  authors  make  a  “tall 
cache”  assumption  that  M  =  Q(L2).  We  do  not  make  this  assumption,  so  latency  optimality  is  a  stricter 
requirement  in  our  model.  Additionally,  their  model  only  allows  messages  of  size  L,  which  is  equivalent  to 
setting  (5  =  0  in  our  model. 

One  may  also  consider  models  where  there  is  a  hierarchy  of  memories,  each  faster  and  smaller  than  the 
previous  one,  where  the  largest/slowest  memory  is  infinite  and  the  computation  occurs  only  in  the  small¬ 
est/fastest  memory,  and  one  wishes  to  minimize  the  communication  costs  across  every  level  of  the  hierarchy. 
A  cache-oblivious  algorithm  is  one  that  requires  no  tuning  based  on  the  machine  parameters  M  and  L.  An 
algorithm  that  is  cache-oblivious  and  communication-optimal  in  the  two-level  model,  such  as  the  SMLU 
algorithm  that  is  the  subject  of  this  paper,  is  also  communication-optimal  with  respect  to  every  level  of  any 
hierarchical  model. 


3  Data  Layouts 

We  consider  two  main  data  layouts:  column  major  (CM)  and  rectangular  recursive  (RR).  The  CM  layout  is 
the  layout  used  by  standard  libraries  like  LAPACK  and  stores  each  column  contiguously  with  elements  in  a 
column  ordered  from  top  to  bottom  and  columns  themselves  ordered  from  left  to  right.  The  RR  layout  is  a 
generalization  of  Morton  ordering  [16],  which  is  well-defined  for  square  matrices  with  dimension  a  power 
of  two. 

The  main  motivation  for  recursive  layouts  like  RR  is  that  they  map  well  to  recursive  algorithms:  at 
every  node  in  the  recursion  tree,  the  computation  involves  submatrices  which  arc  stored  contiguously  in 
memory.  The  RR  layout,  illustrated  in  Figure  lb,  corresponds  to  recursively  splitting  the  largest  dimension 
of  the  matrix  and  storing  each  of  the  two  submatrices  contiguously  in  memory.  Choosing  how  to  break 
ties  for  a  square  matrix  (choosing  whether  to  split  horizontally  or  vertically)  and  deciding  how  to  split  odd 
dimensions  leads  to  several  variations  of  the  RR  layout.  Here,  we  choose  to  split  square  matrices  into  left 
and  right  halves  because  that  corresponds  most  closely  to  the  CM  layout,  and  for  odd  dimensions,  we  choose 
to  assign  the  extra  row  to  top  halves  and  the  extra  column  to  left  halves.  The  latter  decision  is  arbitrary  but 
the  same  choice  must  be  made  throughout  the  algorithm.  When  applied  to  square  power-of-two  matrices, 
our  choices  lead  to  a  standard  M-Morton  ordering. 

There  arc  several  alternatives  for  generalizing  Morton  ordering  [7,  8,  9,  12].  The  simplest  approach  is  to 
pad  both  rows  and  columns  with  zeros  to  obtain  a  square  power-of-two  matrix.  However  this  can  increase 
the  number  of  matrix  elements  by  a  factor  of  4  times  the  ratio  of  large  dimension  to  small  dimension.  This 
approach  is  explored  in  [9],  where  the  authors  avoid  the  extra  space  and  computation  on  padded  rows  and 
columns  using  “decorations”  which  denote  full,  partial,  and  zero  submatrices.  Hybrid  layouts  are  also  often 
used,  storing  small  blocks  in  column  or  row-major  layout  and  ordering  the  blocks  using  a  Morton  ordering. 
One  can  view  our  RR  layout  as  the  “recursive  block  column  layout”  from  [7]  with  lxl  block  sizes. 

We  consider  another  alternative  for  generalizing  Morton  ordering  to  a  specific  class  of  rectangular  matri¬ 
ces.  If  the  smaller  dimension  of  a  rectangular  matrix  is  a  power  of  two  and  the  larger  dimension  is  a  multiple 
of  the  smaller  dimension,  then  the  matrix  can  be  divided  up  into  several  square  power-of-two  matrices.  In 
this  case,  the  elements  within  the  square  submatrices  can  be  stored  in  standard  Morton  ordering,  and  the 
squares  themselves  can  be  ordered  from  top  to  bottom  or  left  to  right.  This  layout  is  illustrated  in  Figure  la. 
For  the  purposes  of  LU  and  QR  factorizations,  if  the  original  matrix  is  square  with  power  of  two  dimension, 
then  all  submatrices  encountered  can  be  stored  in  this  layout.  To  preserve  generality  and  avoid  padding  the 
original  matrix,  we  describe  our  algorithms  with  the  RR  layout  instead  of  this  “stack  of  squares”  layout. 
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(a)  Stacks  of  squares  layout  for  a  12  x  4  matrix.  The 
3  square  4x4  blocks  are  stored  contiguously,  each  in 
Morton  order. 


(b)  Rectangular  recursive  (RR)  order  for  a  11  x  5  matrix. 
At  the  first  level,  the  top  6  rows  are  split  from  the  bottom 
5.  At  the  second  level,  the  top  6x5  block  is  split  into 
two  3x5  blocks,  whereas  the  bottom  5x5  block  is  split 
into  a  5  x  3  block  and  a  5  x  2  block. 
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Figure  1:  Cartoon  of  rectangular  recursive  algorithm  for  LU  [17].  Shaded  areas  correspond  to  computation. 
In  SMLU,  the  first  and  fourth  steps  assume  column-major  ordering,  and  the  second  and  third  steps  assume 
rectangular  recursive  ordering. 

4  Rectangular  Recursive  Algorithms  for  LU  and  QR 

Many  recursive  algorithms  for  linear  algebra  computations  are  cache-oblivious,  but  in  order  to  minimize 
latency  costs  the  data  layout  must  be  chosen  carefully.  Morton  ordering  works  very  well  for  the  recursive 
matrix  multiplication  algorithm,  where  the  eight  recursive  subproblems  involve  matrix  quadrants.  The  nat¬ 
ural  extension  of  Morton  ordering  to  symmetric  matrices  also  maps  nicely  to  the  square  recursive  algorithm 
for  Cholesky  decomposition  [1,  2,  13].  In  this  algorithm,  subroutines  and  recursive  subproblems  involve 
matrix  quadrants  (which  may  be  symmetric,  triangular,  or  dense). 

For  LU  decomposition,  the  analogous  square  recursive  algorithm  (and  standard  Morton  ordering)  is  not 
sufficient.  In  order  to  maintain  numerical  stability,  row  (and  possibly  column)  interchanges  are  necessary. 
Partial  pivoting,  the  most  common  scheme,  involves  at  each  step  of  the  algorithm  selecting  the  maximum 
element  in  absolute  value  in  a  column  and  interchanging  the  corresponding  row  with  the  diagonal  element’s 
row.  For  this  reason,  the  square  recursive  algorithm  for  Cholesky  does  not  generalize  to  nonsymmetric  ma¬ 
trices:  the  top  left  quadrant  of  the  matrix  cannot  be  factored  without  accessing  (and  possibly  interchanging) 
rows  from  the  bottom  left  quadrant  of  the  matrix. 

In  order  to  respect  the  column-access  requirement  of  partial  pivoting,  Toledo  [17]  and  Gustavson  [13] 
developed  a  “rectangular  recursive”  algorithm  which  recursively  splits  the  matrix  into  left  and  right  halves 
instead  of  quadrants.  The  steps  of  the  computation  are  shown  in  Figure  1 .  Given  an  m  x  n  input  matrix, 
recursive  subproblems  are  of  size  m  x  ^  and  (m  —  x  ”,  and  algorithms  for  triangular  solve  with  multiple 
right  hand  sides  (TRSM)  and  matrix  multiplication  are  used  as  subroutines.  Because  the  recursion  splits  the 
matrix  into  left  and  right  halves,  the  base  of  the  recursion  consists  of  factoring  single  columns  with  partial 
pivoting:  finding  the  maximum  element,  swapping  it  with  the  diagonal,  and  scaling  the  column  with  its 
reciprocal. 

A  similar  algorithm  for  QR  decomposition  was  developed  by  Elmroth  and  Gustavson  [6].  The  standard 
Householder  QR  algorithm  works  column-by-column,  computing  a  Householder  vector  that  annihilates  all 
subdiagonal  entries  in  the  column  and  applying  the  orthogonal  transformation  to  the  trailing  matrix.  In 
order  to  compute  one  Householder  vector  per  column,  a  rectangular  recursive  algorithm  is  necessary  so  that 
the  base  of  the  recursion  consists  of  computing  a  single  Householder  vector  to  annihilate  the  entire  column 
below  the  diagonal.  The  basic  steps  of  the  computation  are  shown  in  Figure  2.  In  the  rectangular  recursive 
QR  algorithm,  an  auxiliary  triangular  matrix  T  is  computed  so  that  the  update  of  the  hailing  matrix  can  be 
done  with  matrix  multiplication. 

Abandoning  the  requirement  that  the  orthogonal  factor  Q  be  computed  with  one  Householder  vector  per 
column  allows  for  a  square  recursive  algorithm  for  QR  [9].  The  square  recursive  algorithm  maps  nicely 
onto  standard  Morton  ordering,  as  each  computation  involves  matrix  quadrants.  However,  because  the 
orthogonalization  is  based  on  many  Givens  rotations  per  column  instead  of  one  Householder  vector  per 


5 


Figure  2:  Cartoon  of  rectangular  recursive  algorithm  for  QR  [6].  Shaded  areas  correspond  to  computation. 
The  triangles  correspond  to  the  intermediate  T  factor.  In  SMQR,  the  first  and  fourth  steps  assume  column- 
major  ordering,  and  the  second  and  third  steps  assume  rectangular  recursive  ordering. 


Figure  3:  One  recursive  step  in  converting  from  column-major  to  rectangular  recursive  order. 


column,  the  standard  trailing  matrix  update  techniques  do  not  apply.  The  approach  from  [9]  is  to  explicitly 
construct  the  orthogonal  factor  Q,  using  matrix  multiplication  to  update  the  trailing  matrix.  This  technique 
leads  to  an  increase  in  the  total  flop  count  of  the  decomposition  compared  to  the  standard  algorithm,  by  a 
factor  of  approximately  3  x . 

By  using  shape  morphing,  we  show  that  the  rectangular  recursive  algorithm  of  Elmroth  and  Gustavson 
[6]  can  maintain  the  standard  format  of  representing  the  orthogonal  factor  by  its  Householder  vectors  (one 
per  column)  and  still  achieve  cache-obliviousness,  minimizing  both  words  and  messages.  The  rectangular 
recursive  algorithm  also  increases  the  flop  count  with  respect  to  the  standard  algorithm,  by  about  17%  for 
tall  skinny  matrices  and  about  30%  for  square  matrices.  To  limit  the  increase  in  computation,  one  can  use 
a  hybrid  algorithm,  using  the  rectangular  recursive  algorithm  on  panels  of  sufficiently  small  width.  Since 
this  tuning  parameter  prevents  the  algorithm  from  begin  cache-oblivious,  we  do  not  consider  the  hybrid 
algorithm  here. 
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5  Subroutines  and  Their  Communication  Costs 


5.1  Converting  Rectangular  Recursive  to  Column  Major  and  Back 

The  algorithm  for  reshaping  a  matrix  from  column-major  order  to  rectangular  recursive  order  is  provided  in 
Algorithm  4.  The  algorithm  is  recursive;  at  each  step  it  splits  the  matrix  along  its  largest  dimension  and  is 
then  recursively  called  on  both  submatrices.  When  the  input  is  short  and  fat  (m  <  n),  splitting  the  matrix 
does  not  require  any  data  movement,  since  in  column-major  order  the  left  and  right  halves  of  the  matrix  arc 
already  contiguous.  When  the  input  is  tall  and  skinny  (m  >  n),  splitting  the  matrix  requires  "separating” 
each  column  into  its  top  and  bottom  halves.  We  perform  this  operation  with  the  Separate  function:  since 
it  involves  contiguously  streaming  through  the  input  and  contiguously  writing  to  two  output  locations,  the 
communication  cost  is  0(mn )  words  and  O  (^)  messages,  as  illustrated  in  Figure  3.  The  recurrence  for 
the  communication  cost  is  therefore 

{2Reshape(m/2,  n)  +  0{{mn/ L)a  +  mn/3)  m  >  n  and  mn  >  M 
2Reshape(m,  n/2)  m  <  n  and  mn  >  M  , 

0((mn/L  +  l)a  +  mn/3)  mn  <  M 

with  solution 

^  .  ,  .  _  /  /  mn  mn  \  /  mn  \  \ 

Reshape(n,  m)  =  (J  log  +  1 J  a  +  mn  ^iog  +  1 J  P j  ■ 

Reshaping  from  rectangular  recursive  order  to  column-major  order  is  described  in  Algorithm  10,  and 
has  identical  costs. 


Algorithm  3  (^  )  =  Separate( A, m,n, m{) 

Input:  A  is  m  x  n  in  column-major  order 

Output:  Ai  is  the  first  m\  rows  of  A  in  column-major  order,  A 2  is  the  remaining  m  —  mi  rows  of  A  in  column-major  order 
for  j  in  l:n  do 

Ai(l  :  mi,j)  =  A(  1  :  mi,j) 

^2(1  :  m  —  mi,j)  =  A(m\  +  1  :  m,j) 

end  for 


Algorithm  4  B  =  R e  s h  a p e To  R e c  u  i ' s  i  v e ( ,4  ,m ,n ) 


Input:  A  is  m  x  n  with  m  >  n  in  column-major  order 
Output:  B  is  the  same  matrix  in  rectangular  recursive  order 
if  m  =  n  =  1  then 
B(1,1)  =  A(1,1) 

return 
end  if 

if  m  >  n  then 


m\  =  [m/2],  m2  =  |_m/2j 
BiN 


B2 


=  Separate(,4,m,n,7ni) 


B  i  =  ReshapeToRecursive(Bi,mi,n) 
B2  =  R  c  s  h  a|ic  To  R  cc  li  rs  i  vc  ( -7^2 ,/-) 


B  = 


B2 


else 


ni  =  [n/2],  712  =  |_n/2j 
(Si  B2)  =  A 

B\  =  ReshapeToRecursive(J3i,ra,77-i) 
B2  =  ReshapeToRecursive(52,m,7i2) 
B  =  (Si  B2) 

end  if 
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5.2  Rectangular  Matrix  Multiplication 


The  SMLU  algorithm  requires  a  recursive  matrix  multiplication  algorithm  for  square  matrices  stored  in 
rectangular  recorsive  order.  The  communication  costs  of  recursive  matrix  multiplication  were  first  analyzed 
in  [10].  In  our  model  they  arc  worked  out  in  [2]  and  arc 

, .  _  /  /  mnk  mn  +  mk  +  nk 

GEMM(m,  n,k)^0[[-m  + - j - 

where  m,  n,  k  arc  the  three  matrix  dimensions.  For  completeness,  the  algorithm  for  rectangular  recursive 
layout  appears  in  Algorithm  1 1 . 

5.3  Rectangular  Triangular  Solve 

The  SMLU  algorithm  requires  a  recursive  triangular  solve  on  matrices  stored  in  Morton  order.  An  algorithm 
for  square  matrices  with  optimal  communication  costs  is  given  in  [2].  In  Algorithm  5  we  generalize  to  the 
case  of  rectangular  matrices.  Let  A  be  an  m  x  n  matrix,  and  L  he  a  m  x  rn  unit  lower  triangular  matrix.1 
At  each  recursive  step,  split  the  larger  of  m  and  n.  Splitting  m  gives  two  recursive  calls  to  TRSM  and  one 

call  to  matrix  multiplication.  Splitting  n  gives  two  recursive  calls  to  TRSM.  Thus  the  communication  cost 

recurrence  is: 

{2TRSM(m/2,  n)  +  GEMM(m,  m,  n)  m>  n  and  2  mn  +  m2  >  M 

2TRSM(m,  n/2)  n  >  m  and  2 mn  +  m2  >  M  , 

O  ((( mn  +  m2)/L  +  1  )a  +  (mn  +  m2)/3)  2 mn  +  m?  <  M 

with  solution 

^  f  (  m2n  mn  +  m 
TRSM(m,„)  =  0((z-7=  +  — — 


+  1  )  Oi  + 


m2n 


\[M 


+  mn  +  m2  )  j3 


+  1  )  O'  +  ( 


(  mnk 


Wm 


+  mn  +  mk  +  nk)  (5 


5.4  Pivoting 

The  SMLU  algorithm  returns  a  pivot  vector  p  of  length  m,  where  p(i)  =  j  indicates  that  row  j  in  the  original 
matrix  has  been  pivoted  to  row  i  in  the  output.  Two  subroutines  arc  required  to  manage  the  pivoting. 

First,  ApplyPivots,  presented  as  Algorithm  6,  applies  a  pivot  vector  to  a  matrix.  It  applies  the  pivot 
vector  to  each  column  of  the  matrix  in  sequence.  For  each  column,  it  applies  the  pivot  vector  recursively  by 
streaming  through  the  entire  column  to  separate  entries  between  those  that  belong  in  the  the  top  half  from 
those  that  belong  in  the  bottom  half  of  the  permuted  column,  the  calling  itself  on  both  the  top  and  bottom 
halves.  If  m  <  M,  at  least  one  column  fits  into  memory  and  ApplyPivots  needs  to  read  the  matrix  only 
once.  If  m  >  M,  it  reads  and  writes  each  column  log(m/M)  times.  The  communication  costs  are 

—j—  (^1  +  log  —  J  +  lj  a  +  mn  |^1  +  log  —  J  (5 J  . 

It  is  also  necessary  to  combine  two  pivot  vectors  into  one,  which  is  done  by  CombinePivots,  presented 
in  Algorithm  7.  This  is  accomplished  by  two  calls  to  ApplyPivots  with  n  =  1,  so  the  communication  costs 
are 

CombinePivots (m)  =  O  ({^  a  +  m  ^1  +  log  . 


'A  non-unit  lower  triangular  matrix  changes  only  the  base  case  computation. 


Algorithm  5  U  =  R  c  c  T  R  S  M  (44 ,  L ,  rn , n ) 

Input:  A  is  m  x  n,  L  is  m  x  m  and  unit  lower  triangular,  both  in  rectangular  recursive  layout 
Output:  U  =  L— 1 A  in  rectangular  recursive  layout 
if  rn  =  n  =  1  then 
[/(1,1)  =  A(1,1) 

return 
end  if 

if  m  >  n  then 

mi  =  [m/2],  m2  =  |_m/2j 
(L^  \  _  r 


C/ 1  =  RecTRSM(t/i,Ln,mi,n) 

U2  =  RecGEMM(L2i,f/i,f/2,m2,mi,n) 

C/2  =  RecTRSM(C/2, 1/22^2^) 

TT-  fUl\ 


n\  =  [n/2],7i2  =  Ln/2J 

(c/i  c/2)  =  u 

Ui  =  RecTRSM(C/i,L,m,ni) 
U2  —  RecTRSM(C/2,-C/,m,ri2) 
c/  =  (f/i  (72) 

end  if 


Algorithm  6  A  p p  1  y  P i  v o t s ( A ,  f5 

Input:  A  is  m  x  n  in  column-major  order,  P  is  a  pivot  vector 
Output:  The  rows  of  A  are  pivoted  according  to  P 
if  m  =  n  =  1  then 
return 
end  if 

if  n  =  1  then 

mi  =  [m/2],  m2  =  |_m/2j 
ci  =  new  array  of  length  mi 
C2  =  new  array  of  length  m2 
Pi  =  new  array  of  length  mi 
P2  =  new  array  of  length  m2 
j  =  1;  k  =  1 
for  i  in  1  :  n  do 

if  thenP(z)  <  mi 
Cl  (j)  =  A(i) 

Pi(j)  =  p(*)i  =  3  + 1 

else 

C2O)  =  A(i) 

P2O)  =  P(*)  —  mi  fc  =  fc  +  1 

end  if 
end  for 

ApplyPivots(ci  ,Pi  ,mi ,  1 ) 

ApplyPivots(c2  ,P2  ,m2 , 1 ) 

else 

ni  =  |"n/2],n2  =  Ln/2J 
(Aj  A2)  =  A 
ApplyPivots(  Ai  ,P,m,ni ) 

Apply  Pivots(  A2  ,P,m, n2 ) 

end  if 
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Algorithm  7  P  =  CombinePivots(P£,PR, 

Input:  Pl  ,  Pr  are  left  and  right  pivot  vectors 
Output:  P  is  the  combined  pivot  vector 

//  Convert  the  size  of  the  right  pivot  vector 
k  =  mL  —  mR 
P'R  =  new  vector  of  length  in  / 

P'R(  1  :k)  =  1  :  k 
P'R(k  +  1  :  i nL)  =  PR  +  k 

//  Combine  pivots 

Pj  =  Apply Pivots(l  :  m^.P^ ,ra£,,l) 

P  =  Apply  Pivots(P^,P/,  rriL,  1) 


5.5  Analysis  of  SMLU 

Detailed  pseudocode  for  SMLU  appeal's  in  Algorithm  8.  Each  call  to  SMLU  has  two  recursive  calls  to 
itself,  two  calls  to  ApplyPivots,  four  calls  each  to  ReshapeToRecursive  and  ReshapeToColMajor,  one  call  to 
RecTRSM,  one  to  RecGEMM,  and  one  call  to  CombinePivots.  The  recursive  communication  costs  are  thus 

SMLU(m,n)  <  2SMLU  (jn,  ^  +  2ApplyPivots  (m,  Pj  +  8Reshape  (jn,  Pj  +  TRSM  Pj  + 

(Tl  71  \ 

+  CombinePivots(m) 

^SMLu(™,=)+0((=g+™„log=+™„)(.^)). 

If  M  <  m,  one  column  of  the  matrix  does  not  fit  in  fast  memory,  so  the  base  case  costs  are  SMLU(1,  m)  = 
m  (/?+£))•  If  M  >  in,  then  M/m.  columns  fit  into  fast  memory  at  once,  so  the  base  case  costs  are 
SMLU  ( ,  mj  =  A/  (U  +  /).  The  solution  to  the  recurrence  is 

,  /  Otff^+mnlog  ^logn)  (P  +  l)  +  a)  M<m 

1  °  ((fu2  +  mn  (log  ^)2  +  mnj  (/?  +  f )  +  a)  M  >  m 

Recall  that  the  communication  lower  bound  for  LU  [3],  which  is  attainable  for  LU  without  pivoting,  is 

LU (m,n)  =  n(J^PP  +  rnnj  (z3  +  j)  +  • 


Compared  to  this  lower  bound,  SMLU  has  an  extra  polylogarithmic  factor  on  the  mn  term.  In  the  square 
case,  m  =  n,  SMLU  asymptotically  matches  the  lower  bound  except  in  the  tiny  range 


(log(mn))4 


<  M  <  n2. 


In  the  rectangular  case,  SMLU  may  be  larger  than  the  lower  bound  by  a  logarithmic  factor  in  a  larger  range 


(log(mn))4 


<  M  <  mn. 


Compared  to  the  original  rectangular  recursive  algorithm  for  LU  [13,  17],  with  partial  pivoting  but  without 
shape  morphing,  SMLU  has  a  communication  cost  with  an  extra  log (mn/M)  on  the  mn  term.  Thus,  outside 
the  ranges  given  above,  shape  morphing  does  not  increase  the  communication  costs  asymptotically. 
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Algorithm  8  P  =  SMLLJf  A.  m,  n) 

if  n  =  1  then 

P  =  1  :  m 
i  =  ArgMax(|A|) 

Swap(A(l),  A(i)) 

Swap(P(l),  P(i)) 

Scale(A(2  :  m),  1  /j4(1)) 


//  set  submatrix  dimensions 


ri2  =  n  —  n\ 
mi  =  m 
m2  =  m  —  mi 

//  recurse  on  left  half 
(Ai  A2)  =  A 
Pl  =  SMLU(Ai,  m,  ni) 

//  forward  pivot 
Apply  Pivots  (A2,  P]^,m,  712) 

//  separate  top  mi  rows  from  bottom  m2  rows 

J  =  Separate(Ai,m,ni,mi) 

J  =  Separate(A2,m,ri2,mi) 

//  convert  each  quadrant  to  Morton  ordering 
ReshapeToRecursive( A\  1 ,  mi ,  n\ ) 
ReshapeToRecursive(Ai2,  mi ,  712) 
ReshapeToRecursive(A2i ,  m2,  ni) 
ReshapeToRecursive(A22?  wi2,  ^2) 

//  triangular  solve  with  Morton  ordered  arrays 
A12  =  RecTRSM(Ai2,  An,  ni,  712) 

//  Schur  update  with  Morton  ordered  arrays 
A22  =  RecGEMM(A2i,  A12,  A22,  ^2,  ^1,  ^2) 

//  convert  quadrants  back  to  column  major 
An  =  ReshapeToColMajor(An ,  mi ,  n\) 

A12  =  ReshapeToColMajor(Ai2,  mi,  712) 

A21  =  ReshapeToColMajor(A2i ,  m2 ,  n\ ) 

A22  =  ReshapeToColMajor(A22,  ^2,  ^2) 

//  recurse  on  (bottom  of)  right  half 
Pr  =  SMLU(A22,  m2, 712) 

II  back  pivot 

ApplyPivots(A2i ,  Pr,  m2,  ni) 

//  combine  pivots 

P  =  CombinePivots(i-l,  Pr,  m,  m2) 

A  =  Combine((An  Ai2),(A2i  A22)  ,m,7i,mi) 

end  if 
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A  Supplementary  Algorithms 


Algorithm  9  A  =  Combine(Ai,A.2,m,n,mi) 

Input:  A\  is  m\  x  n  and  A 2  is  m  —  m±  x  n  both  in  column-major  order 
Output:  A  is  m  x  n,  the  first  m\  rows  are  from  A\  and  the  remaining  rows  are  from  A 2 
for  j  in  l:n  do 

A(1  :  mi,  j)  =  A\{1  :  mi,j) 

A(m\  +  1  :  m,j)  =  ^2(1  :  m  —  m\ ,  j) 

end  for 


Algorithm  10  B  =  R e s h  a p e To C o  I M  aj  o r ( A ,  n ,  rn ) 

Input:  A  ism  x  n  with  m  >  n  in  rectangular  recursive  order 
Output:  B  is  the  same  matrix  in  column-major  order 
if  m  =  n  =  1  then 

B(l,l)  =  A(l,l) 

return 
end  if 

if  m  >  n  then 

mi  =  [m/2],  m2  =  |_m/2j 


B\  =  ReshapeToColMajor(i?i,rai,n) 
£?2  =  ReshapeToColMajor(i?2,ra2,n) 
B  =  Combine(Bi,B2,m,n,mi ) 

else 

721  =  [72/2],  722  =  [n/2\ 

(Si  B2)  =  A 

B\  =  ReshapeToColMajor(Si,ra,ni) 
B2  =  ReshapeToColMajor(S2,m,ri2) 
B  =  (/#!  B2) 

end  if 
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Algorithm  11  C  =  RecGEMM(A,B,  C,m,k,n) 

Input:  A  is  in  X  k,  B  is  k  X  n,  C  is  in  X  n,  all  in  rectangular  recursive  layout 
Output:  C  =  C  —  A  ■  B 

if  m  =  0  or  n  =  0  or  k  =  0  then 
return 
end  if 

if  rn  n  k '  1  then 

C(  1, 1)  =  <7(1, 1)  -  A(  1, 1)  •  B(  1, 1) 

return 
end  if 

if  k  >  m  and  k  >  n  then 
A-i  =  A'/2  ,  k-2  =  [k/2\ 

(Ax  A2)  =  A 


RecGEMM(  A\  ,B\  ,C,m,n,ki ) 
RecGEMM(  A2  ,B 2  ,C,m,n,k2 ) 

end  if 

if  m  >  k  and  m  >  n  then 

mi  =  [m/2],  m2  =  Lm/2j 


RecGEMM(Ai  ,B,Ci  ,m\,n,k ) 
RecGEMM(  A2  ,B,C2  ,m2  ,n,  k) 


end  if 

if  n  >  k  and  n  >  m  then 
m  =  |"n/2] ,  n2  =  |n/2j 
(Bi  B2)  =  B 
(Ci  C2)  =  C 
RecGEMM(A,Bi  ,Ci  ,m,ni  ,k ) 
RecGEMM(A,5i,C2,m,ni,/c) 

c  =  (Cl  Cl) 

end  if 
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